Hmm It's with cameras. You look quite funny at the moment, Tim. Nice. Mm yes. Yes, perhaps yeah, but I mean they're all very very similar, all those X_M_L_ files. I had um a look at the um for example A_S_ Mm. Take Mozilla. Yeah, I think that would be m I mean uh for me it's quite difficult to say o on, you know, w um what scope one annotation would have. I think it would be certainly more than one word and I think it would work best with utterances or segments. Yep. I think just the segments as they're segmented in the for example in the point se point six five. I think so. And you would s probably still need um your values for the words, um perhaps for for um the keywords that are displayed when you click on something. Yeah, but that's another representation then, I think, f for the importance measure. Should yeah. Um po Yes. No, it's not. It's um for example, there's uh B_D_B_O_O_ one C_ point six. Uh I can't see anything very much, but yeah, that's that's it. Can you use a bigger font? Okay, so here there's for um each segment um uh that's really all segments, I mean it's it can be words or topics or um anything I think. Um that's why th it's for example, time provenance dialogue act or time provenance segment or time provenance No, but it yeah. Yeah, it p Oh, well some of them point to the words, other others point to the um dialogue acts. So that's kind of the global thing that ties together other things. Um yeah, they have their NITE I_D_, the time The segment is what is displayed in one line um by NITE. Yeah, it looks uh uh the segment thing looks in the into the words X_M_L_ file and the dialogue acts um thing. Yeah, I think it's in a different Um No, their their annotation of dialogue acts, and one um segment can have or one um of those word strings that are presented in one line can have several um dialogue acts annotated on it. That's why Yeah, it's from words. Yes. That's just below. But I think these segments are perhaps not exactly what we are looking at, because that's just o one tying all the others together. And the information or r I mean we are going to create a file that looks more like um the words file. I think. Yes. Yeah, they are. Um to discourse marker, DIS F_ marker fi point five. Yeah, that's there no no, up up. Go up. Yeah, right. Yeah, this is all the speaker D_. You and the speaker D_. Yes. Um you can see whether there's somebody else or not. You can see that in the w um W_ point um first it's W_ point um five uh fifty four and then it's seventy six, so there are um this number of utterances of other speakers in-between. Yeah, I think we would also need two of those. Yeah, I think we would have to have the same structure, one that points to it, where we tie together, all our information. So we would perhaps have to make one that's similar. Yes, and uh then another one that would perhaps give the actual probability value. Yeah, it could could probably be in the same file, I don't know. But if you have several layers, then you can't represent them on the same file. Yeah, in segment point one. Um there line below. Yeah, that's what I meant actually, yeah. Mm-hmm. Um But I I mean there's not all the information that is in in the c corpus in there in the segments file. I think that is just the um things that are loaded every time, but we have lazy loading, so it can load more than that. Perhaps we could just try to cope. Um, you know, to make our file one of those lazy loaded. But I mean I've no real idea how that works. Yeah, but yes, but it has um, you know, a basic thing that it loads, I think. So for example every time it loads the segment things, it can't it can't not display words, I think. And it eras I think so. That's how I understood it. Because otherwise in the in the other things there is no information about what the participant name is and so on. So in the segments file it's really the very basic that have to be displayed for any thing.? Yeah, I thin I mean it wouldn't be difficult to um create a file of the p sh of that shape. I mean that would be no problem. Um yeah, I I don't know how about you with your words, but um for my segments were the um F_ zero measurements and all those values I get from there. Um I always store the beg uh start and end time of everything I calculated there, so I would just have to pu put it uh yes. Yeah, it is. Uh actu but actually my segments are not always the same as their segments, because and their segments, there are um Pause um pauses sometimes. A Um yes, but uh yeah, I mean one segment of theirs is sometimes two segments of mine, that's just what I meant. Yeah, but for example I made a file that contains just the um start times. It could contain end times and words for all speakers of all meetings. I mean that's just the same thing as I gave, yeah, as I gave to you. But um uh displaying also a start and end times of every word. So you could perhaps use it to match um it with your additional information you got, couldn't you? Perl, a Perl script, yeah. No, no. Mm yeah. But would you then not get the typical words for every speaker? If you Because it would then say how it find that it's that speaker. Yeah. It's probability, yeah. Or or value, yeah. So, could you put the more information back in then? Yeah, but the topic isn't it? That's a topic information. Mm-hmm. Yeah. So how sure or unsure you are about the what's following our b our context measures, yeah. Yeah. But do we have enough informa enough data for that it gives us sensible things? Because I mean the the words we're look really looking for appear not too often, and if they appear five times in the meeting and they have each time a different, you know, differ some different surrounded words, perhaps we have not enough data. I don't understand what you Yeah, okay. Mm-hmm. But that would give the same value for one word every time I mean for for a specific word. Yeah, because then it would be quite easy to re-integrate it into such a Okay, but then it would also be possible to re-integrate it. It would be more dif I mean it would be a bit more work, but then Yes. Um and what well, what score would a word get that just occurs once in all the corpus for example? Even though there are all the other same topics where it doesn't occur? Mm-hmm. So would it be higher-scored than a word that um occurs in every um sequence of the same topic? Yeah, I mean um we have uh s uh our several topics. For example we have twenty topics, and um for one of these topics there are five occurrences, and we have one word that Yes, but across meetings um there will be the same topic several times. Ah. It wasn okay. I thought that was what you were doing. Yeah, but Yeah, but you make also um segment similarity, I think that was I I thought that would And they they are not related then to other blocks of Okay. I um Oh, okay. I misunderstood that then. So yeah, how would it work? I mean what would um the his information then be useful for in r Rainbow? Um I thought the point about that was that we would put um into s um serial categories all the segments of the same topic across meetings. Okay. Yeah, but but then it does okay. Mm-hmm. Mm. Mm. Yep. Yeah. Mm yeah. Yes. Yeah, I mean I haven't really decided on uh how to really get the information out of this now, because um when I have for example um increased speaker overlap, that applies to several turns of course. But I could give um the information about, yeah, how how important this is to each of those segments. Yeah, pros um but you proposed um we we should break that down to have um the smallest um smallest um unity um of time duration should be one word. I mean, that would be because that would be what naturally came out of her thing. And Yeah sure, but I could then if I if I have um the value for a segment, I would perhaps just give all the important wor or mm yeah no. Yes. Mm-hmm. Yeah, I would say that someone have to break it down, aye? I would probably somehow have to break it down to to that level, yeah. Yes, sure. Yeah, probably somehow it should work that way. Yeah, but then okay, but th at that moment it would be better for me to just make it on a per-segment basis right way, and let her adapt it also to the segment basis. Yeah. Yeah, I mean mine are only more, but they have the same start and end point. Sometimes they're two segments and one with a gap in-between, but um the have um th th they don't overlap. No no, they don't, but um Um they're I mean my segments overlap in the same way as theirs do. But sometimes I have um split one of the segments into two segments, but it's easy to match the start and end times to map it to their segments. I uh hope so, yeah. Mm-hmm, yeah. That would be quite difficult, but yeah. Yeah, I mean that's what I'm already doing with my s uh separate Yeah, it's tells you even which words it is. Yep. Should be possible. Yep. Yeah, I hope to have some value quite soon, but um I just worked um I mean I just calculated the values values for um the average F_ zeros and um they're what's c I mean I didn't have a look at the data yet, but um they vary quite widely even for the same speaker across meetings. So one speaker had um an average of about a hun one hundred um in one meeting and one hundred sixty in another meeting, that's really that's strange. I mean perhaps it's because of laughter or something, that's what I was thinking, but I didn't have I'll probably look at it. Yeah, that's what I read in one of those papers uh as well but Yeah, but it could mess up things quite considerably. Yeah. But uh ar yeah. But all this is quite, you know, data-intensive. I when I um let the um just calculate the average F_ zero levels, it I think it needed more than half an hour, considerably more than what have half an hour f to delay it for our meetings, because um yeah, m I mean we have seventy five hours per, yeah, av average of six speakers, and they're measured every O_ point O_ one six seconds. And that gives us quite a lot of values. Yeah, what I did at the moment is um I um, yeah, I got the F_ zero values from for each speaker for his headphone, and I only take those um the w uh that were recorded at the time where he was actually speaking. And for those I calculated um the average. Zero. So that gave me um now one value per speaker per meeting. No no, that's just the average. No, that i yeah. Yes, I mean I r just uh needed to have this value now to relate um m how yeah. Yeah, I think. Yeah. Um amusement and so on is yeah. Yeah. I I I mean it I actually have to look at the data what causes these um Because it's quite funny to have a m male speaking at two hun two hundred um. Yeah, sure but Mm-hmm. Yeah, and I am at the moment fiddling around with my data and not quite seeing how I get to a sensible abstraction level, you know. From my very Yeah. Yeah. So I mean you will be happy with some data, even if it doesn't make much sense. Try uh try yeah. Mm-hmm. Yeah, if it's already on segments base, that's not too much.
Hmm? This isn't supposed to look like just Ah, sod that. Okay, there. Group K_. Da. Yeah. Yeah, I guess. Whatever. It's yeah, I never get this, so what's the point of if this is strapless if this strap's supposed to go behind m I obviously doing this wrong. Obviously failing this like whatev calling centre and tra Like this and then yeah, you know what the shit has to do now. Okay. Yes, so um actually I just realise I don't have power, let me just switch on the other which is gonna run out. Okay. So there's probably not much to talk about at the moment in terms of like talking about each other's stuff, I mean beyond what we've been talking about yesterday. So what I've been thinking is maybe if we try to really make sense together of the X_M_L_ format to be sure that we can all produce the data that we are producing in a way or at least I mean I guess my data I'll probably load in from completely different way anyway because it's matrix, but all this stuff that goes with annotation, that we have it in the right format. Maybe if we just sort of pool what we know about the X_M_L_ format and try to make sense of it. I don't at all know how to go about this. Um what I did just now is I downloaded Um I downloaded some files, some X_M_L_ files and I was thinking if we maybe just should go through some of them and like try to see if you understand the general structure and pool together what we think about the structure, what we think that things mean. And then talk about how the annotation for information density and stuff should maybe be structured in a similar way. Or but I don't r I don't really have an idea like what, other words, what we would do today in this meeting. I mean What's that part? Mm-hmm. Mm-hmm. Yes, so does anyone have a good idea where to start like which what so what is it we talk about getting in at the moment? We're basically just 'cause the two of you will merge your data together, right? In some way like i in the end it'll result in one annotation. So the only thing we're trying to tie in is just an additional annotation which is similar to the annotation about what what would it be s most similar to. Simply to the segmentation information or Okay. I'm not sure it Ah. See, that's what I was thinking. Not sure what software I have at the moment, which would be hello? Um Yeah. Ah. Does anyone know any good standards software displays X_M_L_, I'm not sure. Yeah, but I'm afraid I only have Firefox and Firefox doesn't have the nice X_M_L_ viewer anymore. I think I don't have Mozilla.. No. I can probably just Emacs them. But But going from from a word business, wouldn't it be easier to then go back and calculate for each utterance? You know, wha wha each turn, so if we Yeah. So d Does anyone does anyone know in which file the actual, like what's it called, the splitting into the utterances? Is is called is that Is it What does it end with? Does it with with SAX? Does that look like it? Okay. Be nice. Let's actually just see if maybe K_ right. Oh this is how much time you spend just getting the right software going. Is that better to see? I guess yeah, white and black is better to see. So just see if I can view Okay. So This this But what is this? I mean this doesn't contain any content. Like any text. Oh wait, somebody s s somebody explain that slowly to me. So So what do all these things in this file have in common? What what are they? No but w like uh so they are they are segments in what sense? Wh what's definition of a segment then? Okay. And and that's usually dialogue what what's difference i between provenance dialogue act and between segmenter? I don't A dialogue act uh a dialogue act annotations of dialogue act. So is that just another representation of text? Just see, can we have this files somewhere. So the stuff that's that's called segment here that do we know which file this is pulling from? From yeah, it's it's c it from words X_M_L_, is that it's saying here? Let me see if I have that words X_M_L_ file. So let's for example look at Okay, let me just try to get them all to a different screen. Uh uh is that what we were Sorry, I'm just I'm trying to arrange them, so that we have one above the other, so to make some sense of that. So no. Here we had Um I'm just trying to understand what this segments file is about. Like it's it's referencing here it's referencing to To a certain position in the in the words file. So it's that non-vocal sound one here. That's the mic noise, okay. But what's the does it always just oh no, it's it's referencing to several word structures. So this would be basically be a a sequence, right? So that's from W_ fifty two. Are they not ordered or oh yeah, from W_ fifty two down to W_ what are we talking about? And yeah, w sorry. Okay. So this so this is somebody saying it it doesn't. Do we have any confirmation do we actually see from this file who's doing it, just out of interest? Oh. Sorry, in speaker D_'s file. So this is where he'd like time stops. And then it only starts at a later time, because oh, you don't see when I'm pointing at screen, it's just for this. So then it starts here this time against speaker D_ because in-between there's somebody else. Okay. Yeah. Okay. Okay. So the the file with annotations for information density, what would that be like? Would that be like the segments file, giving a start word and end word, or or would that How would that be to best tie in with the system? Sorry? Something like something like this or like where you give sort of a reference to a beginning and an end position. Well that would be in the same file, wouldn't it? It would just couldn't we have a like uh I guess we could introduce a different um We have for now if we just do something which for every segment in here attaches actually can we not tie it to the segments file? Would that not be the way how they would want it to have to do as if we in the end, if we want to attach to each of those segments one number for now. And those segments have uh do they have an I_D_? Yeah, I guess they do. So this is the I_D_ of a specific segment, right? Yeah, so wouldn't it I mean the problem is they've me haven't looked at the exact inner workings of their of their engineered. But you'd think that the easiest way and the way that how it's intended to be would be just, if we have here a link to the segment, a like an I_D_ for that segment, that we just create another file which links to the segment and then has an additional value which is the number. Yeah, yeah. Yeah, oh sorry, I I thought we were talking about having two files. I was think yeah, I I think that I I understood wrong. I thought you were wanting to have two different X_M_L_ files, with one the reference and one just th just the number. And I was thinking that's probably what you meant, just um having like for each sort of of our segments having just the I_D_, which is referencing to these segments here and another attribute which is the the value. So this whole information we would then store in this Ah, no, that I don't have access to because I didn't download like there's this one meta information file where it describes the structure of all the files and describes which um which attributes they bring in. So we would add that to that file? Saying that sort of we bring in information density. And then we would create the file of that type, which we probably couldn't call it segment. I'm not sure. We probably might have to have a different word for it. I'm not sure if it there's any trouble with it repeating. In the existing segments file. Yeah, I reckon actually if we make a copy of it What uh what are you saying about the w I th doesn't the lazy loading apply to everything? I mean that it sort of dynamically loads. You think it i it l it loads the whole of the segments file every time. Okay. The segment at the moment is split up over Hmm. No but I think it's too early to really like discuss that in detail, because we don't at all understand at the moment how the internal data structure like how the loading works. So maybe if we g if we go ahead, do you think it would be possible for you to do an like something like segments or maybe just a copy of segment which has an attribute for for each segment, um but it is like with a with a value with a density value. Would it be easier though, because all your methods are sort of not working with their whole time frame structure there. So would it be easy for you to to tie the things together. Like if you're doing it on on the word basis here with those words, that in the end you then tie it back in into the right segment here. I mean you probably have to do a bit of Hmm. You're doing it time-based at the moment. So this isn't directly having time references, but you can get the oh, it is actually here. Okay. So if you slice it up by time, you'd probably be able to just like attach like just some attribute of info val just f to each of those segments. Mm-hmm. 'Kay. Um I don't understand enough of what their data structure is. This probably would be a lot easier would I really understand how they are handling the data internally, 'cause then we c then I could say oh, it's easy to just tie it in if you just have it time-stamped that just reference by words and stuff. But t at the moment So you would you're doing i you're doing a word by word base. Mm-hmm. Okay. But every word does does the word picking, like does it always have the same information value in your thing i does it depend on its position like It be d depends on the position. Hmm. And you're doing this via some software that's like external software, so you can't So what if you if you if you get the results from that software and you go back over it then with that file sort of, you write an algorithm which which then goes back because they they're in the right order still and stuff, right? So that shouldn't be a problem. Oh but this is by speaker here which makes it slightly more difficult. The problem is that actually NITE X_M_L_ probably provides a lot of the tools that we'd need to do that. In in a temporal sequence. But but would the information density algorithm still make any sense if you split them up, because I mean the whole I thought the whole thing is that you look at the frequency of a word in that specific how often a word inc occurs in a certain topic versus how often it occurs over the whole corpus and that from that it calculates. Mm-hmm. Yeah, I have a gut feeling it's not a good idea to split it. Yeah. No, but I mean it can't be that difficult. If you if you already have you have in the right order all the words with a with a score to them, and you have a file which has each word and a time stamp. So those two tied together have each word and its time and and and and its proba and its value. Yeah. So they're they're at the moment so It appears to me that Rainbow was made for something quite different. Hmm. I actually like um um i are you actually sure that Rainbow is doing a measure, like is returning a measure of what we are trying to measure, 'cause it it seems to me that it's just it sounds like something quite different in in d many aspects. Mm-hmm. And w Yeah. Yeah. But where do you have the original category information from? Yeah. Yeah. So where do you where do you have them from at the moment the split up? In i in the t in the topics, in the in the human topic s um so you've split them up by topic at the moment. Yeah, but Okay okay okay. 'Cause I was just thinking that doesn't make sense at all, but yeah, if you if that's just while you're waiting. Okay. Have you ever like looked into different ways of calculating, 'cause I was just thinking like I mean the for example the the infor um what's it called, the entropy calculation, is that she boxed it under the simple calculation that you could probably write the script in no time at all and Yeah, but I'm also just like I think that probably the entropy value at the moment for a word is closer to what we're at the moment looking for. I can just like k k I can sit together with you for twenty minutes and just show you the entropy code that I wrote for my other project and it probably and we should work together because it's we used t we have to we'd use the same matrix as I'm using in my latent semantic analysis, you'd use to calculate entropy scores. And then we'd have um a score which actually which would be the same for the word in each position. So in that sense it's doing something a bit different, and like w basically the score that I'm talking about is a conditional entropy score which just checks how much information, the fact that there's one word tells you about what would be the next word. But that's a relatively good measure of whether that's a very specific word, in which case they are usually words which tell you quite a lot or a very general word which usually doesn't tell you quite a lot. Um it it's basically the it's the standard entropy formula. And you sort of you you Yes, yes, it has. I mean, I think the official le sort of the official description of what it tells you is um how much that like the fact that a given word occurs tells you about what's the next word's gonna be. Which doesn't sound too exciting, but it it just works out in the way that words which are promiscuous and which occur with everything all over the place have very low a scores on that, and also usually end up being the words which are p Least like expressive, and l contain less information. Yeah, function words or just very general nouns. Pr probably like what whatev for example the word computer in that context. You could imagine it to f like be in all in all the contexts. Yeah, so we would we w wouldn't do it by word, we would sorry, okay, um I was I was getting that w actually, sorry, I was getting that wrong, I was getting it from what I did my project. Now in this case, we would do it by per mee words per meeting. So Hmm. Yes, it it's probably doing it's probably doing quite the same thing in the end, but I'm just saying like with that thing you would easily have an algorithm which at the moment provides you for each word with a score which we can use. Um no um, I was I was I was describing the wrong thing. In this case we wouldn't be doing how much it tells you about another word. In this case we would be doing given that you know a word, how good is it at predicting from which um specific topic that was. So that would yeah, in that sense it's the same thing here. Yeah, yeah, for a specific Yeah. The same word the word yesterday would be would have the same score all over the place. Okay. I c But but your category thing depends on that we not just have topic segments, but also that these topic segments we have them in categories. You you do wouldn't you need several documents for each category? Or several segments for each category. Do but will it word without that at all? 'Cause Yeah, I mean s so in our case basically every every s topic would be its own category. And the question is does the algorithm still make any sense in that I don't understand the algorithm enough for that. But what I'm really like because the entropical um calculation is so simple, maybe we should look into making that score just as a preliminary score that we have. Like it it's a very it gives you like I've looked at the result, it gives you basically something in the end which vaguely tells you just whether a word is a very specific word or a very general word. And I like there is some hope that probably having just sentences where there's lots of very specific words, if you mark them as being more interesting than the words which are only very general words, that they would get us somewhere. Probably one point zero is very high information value. Yeah, 'cause this would Yeah, in in a sense I mean this is a bit like the what like document frequency over total frequency. Measure it sort of just going by What do you mean every sequence of the same Mm-hmm. Within th within the topic, so like topic we had to when you say topic, you mean like just like from from a beginning to end point, like within one meeting there are several topics? But we don't have that information anywhere, do we? But but you you are segmenting. Well, I'm I'm doing one on segment similarity in the end, yeah. I'm doing like finding similar segments, basic latent semantic analysis. But But like f for now, like your segmentation is just splitting a meeting up into different blocks ver Not not from what Colin is doing from what I no. It's only like I'm writing an algorithm that which then tries to also again based on word p occurrence patterns try to link together maybe different ones of those. So So that yeah. Yeah, I'm also a bit b like I'm not a hundred percent sure about Rainbow being the right thing, 'cause it seems that Rainbow does in its structure quite rely on having different examples of the same category sort of in in a way. 'Kay. You know what, as a byproduct of my L_S_A_ I'll provide um a vocabulary like sort of a dictionary which for each word gives an entropy score entropy score, which just tells you of how much information the presence of a word tells you about which topic it is. Not which category, like I'm not I'm not lumping together separate topic segments into categories. But just like how much this word tells you about wh how likely that w the occurrence of that word makes it that it's a specific segment topic segment. Which is some measure already of how widespread this word is versus how specific f to a certain segment that is. And I'll just provide that, because that's just not much more work than just the usual thing. And then we can see how we can tie that in with the other stuff. So if you um keep on working on Rainbow meanwhile and try to find a way how to tie your Rainbow stuff into some way that we can attach it to a certain time segment. I'm just thinking, wh if It it used is each word completely unique, like sort of does it treat each word, each occurrence of word, as a completely unique event. Or does it, I mean, no, it it has to, I mean, basically the the form of the word is important, right? We can't just replace the word by an arbitrary string. Because it looks if the same word occurs again and stuff. Yeah, it it has to work with a m yeah, well I think it's a stupid question, like it it it has to work on on on the word, like on What I was thinking is whether if we replace the word by something uniquely id identifiable, then it wouldn't make a different which order it is. But that wouldn't work because it needs the word, because that's all it's working on. It's the word and that looks if that word occurs again and versus how often that word occurs in other context, right? So we can't attach some type of information to the word, just to the word string itself, like making an underscore, making the time or something, that wouldn't work. But would it have that in the untruncated version then, like would it s would the output be the untruncated version? It will probably um no, I don't think it would. Yeah, for now, I mean really just like Yeah, I think if we work together on an on an entropy based score. It's let me see if I can demonstrate T I mean, let's just keep on talking meanwhile and I'll try to start that up. It's it's it's really d it it's a very simple thing, but it it basically just does something which tells you how specific a word is. In a sense it's in a it's it's basically just to to a high degree really telling you how r how rare a word is or how common a word is. But uh as as the first step that's probably for for a prototype for next week that's probably not a bad thing, I mean even if it's just that. Even if you just like if you have segments where where lots of rare words occur, highlighted in darker red than segments where all the very common words occur. That's just that's somewhere to start from. And it's it's a bit more sophisticated than that, but then de facto it just ends up doing that mostly, from what I figured out. Um Actually I think I'm not gonna not gonna start that now, because that's probably gonna take too long. So if we get it on a word by word basis whatever you do it'll probably appear in a word by word basis, and what you have on a sort of segment but not quite segment base. 'Kay, what about the following model? I mean this is a very unscientific way of doing it in some sense, but what if we if we take time as the standard unit for now and sort of like make a massive one segment split super-array. Because everything you're doing can in one way or the other be ti tied down to actual time. So if we if we if for if for each one segment time slot we could attach a value, and then it would be easy to then go back when if we have the time marks here, and re-map that onto onto the length of a segment, you know what I mean? So if you have for each word say and we know that word starts at this segment and ends at that segment, and you have for like for some time period w the overlap and that period on the F_ ones and that period or something. So you also have a value which can be tied down to a time. And then we could just in m in Matlab or in something just create some massive super-array of of things for each for each like sort of time sampling slot and and calculate a value for this, and once we have this array, we can go with the script and sort of go for each segment to the starting and end times and say okay, this is from our time segment f here to this time segment there. So we take the some of those and divide them by by the number or something, create the average and put it in as the value for the segment. You know what I mean? Oh. Tha that's that's a fiddling play in the end, but that problem we'll always have. Because we don't have a useful way of automatically evaluating um We don't have a way of usefully evaluating automatically what's good and what's bad, so it's it's probably always gonna be a question of looking at it and saying okay, like running it with different different factor loadings and seeing okay, this way it works this well and this way it works that well. But it probably be more difficult for mapped t to map for you to map it on so for you sort of i Yeah, but so you say that inst like you basically say w having an array where each each cell is is one like is o is one word.. And then you would map your information onto individual words. Hmm? Would you be able to find out which word that is and 'Kay, and and then we could have like some type of just point in the end where the one scores from the all the individual word cells get multiplied all with like for each utterance or whatever. You have get all multiplied with the same value all the ones that are within that utterance. That's sort of the combination of of the two scores. And then we'd have to go back again and then put that back into that segment mode here. So that we because in the end we don't want it on a per word basis, but probably on a per segment base. Oh, yeah yeah, actually that yeah, that's true, so that it's easier if you are able to yeah well, I think that's probably back where we started at this. But you said it wi but you said it's more complicated, because your segments aren't those segments exactly. Hmm. And their segments do overlap. Uh the tho those do or those don't? Mm-hmm. Okay. So you could on their granularity you could on their granularity create a score like for each for each of their segments. Well I guess I mean for you, if you know for each word, if you find that out, then it has to be possible, because if we know sort of this is going from word to word or it this is going from time to time and then there has to be a way then for you say okay, this concerns these following words, and then just make make a simple mean over them. I think an interesting thing is if we don't combine your two scores in the in the X_M_L_ file you had, but if we do that in the software, then we can probably make ways of playing with it in the software and sort of l you know like adapting some sort of control, like playing around with look like it's playing with different weightings for that the utterance-based one versus the word-based one and sort of look at it dynamically. You know what I mean? Like playing around, sort of figuring out what's the best way of combining them by playing around and looking at the results. Yeah. Yeah, we could probably like make a uh um graphic display initially, at least for our experimenting. We'd just place them in different ways and then see how they interact with each other. Okay. Yeah. So do you think y like both of you then can map something onto their segments, like just each of you provide one value, like double value or whatever, like one decimal value or whatever onto onto exactly their segments. It's stated both in It's done both in terms of words and in terms of segments. It's a bit sad sort of that we do this before we've truly figured out how the NITE X_M_L_ thing works, because now we're doing it all by hand and like parsing and un-parsing that thing and it's it's all part of the framework. Uh How much easier would it be if we truly understood this. So 'Kay, I might just change my order of in which I do things and like forget my latent semantic analysis stuff until the weekend and try to really make sense of the of the NITE data system now, so that maybe as soon as I've understood that we find ways of doing that in within the NITE framework already, so that we don't manually have to parse times and entire things together. Well, at the moment what you would do like to to solve this problem is you would sort of like write some Perl script or something that gets this time value out of here and Okay. Mm-hmm. Mm-hmm. So provided that you get your words in the right order, do you think it is an easy task for you to it it's a f relatively feasible task for you to to get just a single value per segment? Okay. And you say you think you're able as well to map onto those segments. And if you're both able to map into those segments, then we should be able to get one file where we have like whatever two values, value A_ and value B_ both as attributes for for this. And that we could load into a prototype and see what types of disp what ways of displaying this information are there. Hmm. Mm-hmm. Hmm. There's probably also social interaction factors in that there's sometimes just a meeting like if people adapt their F_ zero to each other, then they're sometimes p Yeah. Can you not do something like just like not measuring the F_ zero or the amplitude at all, but just like the variance of F_ zero within a certain time frame and like sort of like just have some part for its very low variance with more the same F_ zero and one where there's a lot of more variance. I don't well actually I don't know about that at all. Yeah. Sorry. Mm. Hmm.. Okay. So at the moment you you just you measuring the F_ zeros relative to the average for the speaker or Mm-hmm. Mm-hmm. Mm-hmm. Ri so the average Okay. Um so what you would be feeding in would be just like one value per speaker per meeting, so that oka that that's that's your average baseline, okay okay. No yeah, that that makes m yeah, that makes a lot more sense, yeah. So that would show you how much relative to how he sort of how he's performing generally in that meeting relative to that how he's in a specific segment. He or she. Yeah. Yeah. And for you that would be quite easy to translate into those segments. Yeah, guess I've asked this question fifteen times now. Sorry. Uh Hmm. Oh they w they do exist. The Castrati corps of the International Computer Science institute. Oh well. I'm afraid I have to go soon. But not quite sure, I mean is there anything more we have to talk about anyway? Yeah. See, as soon as as soon as I'm halfway through my L_S_A_ like basically as soon as I have the the matrix built, if of the document by word stuff, it's very easy to then calculate for each word a score, and I can just give you those scores and you can do with them whatever you want. Yeah, I think Colin, Dave and me will actually work on on the Java stuff, and it and we'll just see whatever you whatever you supply us, we'll try to tie in and visualise in some way or another. Um I'll ask Jonathan if we can postpone the meeting to one o'clock. So that would give us a chance of meeting in for an hour before that to discuss the questions that we had. Yeah, I haven't gotten like sort of my confirmation that w Wednesday is fine. I'm not sure if I'm supposed to expect a confirmation for my confirmation from him now or but I'll just email him again and s ask him if we can maybe make it one sorry, we said twelve and I'm asking if he can make one. This is a bit frustrating at the moment, this project, isn't it? It's so like difficult to to get to the point where you understand enough to really feel that. I don't know, I'm not feeling that I'm really working at the moment, I'm more just trying to make sense of everything. And it's a bit too far into the meeting for that, and into the project for that. Hmm. I guess as soon as we have a framework in the w in the type of the prototype where like sort of each of us can tie in their stuff and see what it l how how it looks like and how it performs, you know. That probably makes it a lot easier then, but it's sort of it's a boot-strapping problem, like for the prototype we need some type of data, but to develop the data it would be a lot easier to to have the prototype. Anyway, I gotta go. Hmm? Yes, but if it's if it's in a form which is easy to read in at the moment, that would be fine. Sort of like if basically if we have something like this segments file, but for each of you like just have one attribute. I think it's really easy if we don't merge them before-hand, but if we let them if we combine them in the prototype or don't at the moment, because then we can easily d display them individually, c contrast them to each other and play around with how y to combine them. Exactly, yeah. Yeah. And I mean computationally multiplying two integers or doubles or whatever shouldn't be the thing that slows us down. Yeah. Alright.
Yeah. You you have this going behind your ears, the yeah. Yeah. Hmm. No. Just maybe talk about um how you would give me your data file. Yeah. That's um the interface between having the topic segments and calculating the uh information importance of words. Yeah. So I I would need separate files for each segment or just maybe have delimiters inbetween each segment. Okay, yeah, and y then I can make files of that. Alright. No, I don't kno know it, no. Yeah, that's fine. I just I just need a a title for the file, but doesn't matter what it is. Mm. So uh we have then when we are mixing our um values together, is it um a value for each um expression, for each sentence or something? Yeah. Mm-mm. Yeah. Yeah, I would just end up with uh values for each word, so I wouldn't have um any boundaries for segments at all. I just I just would have the words, and then how we um how long each expression would be, I don't know. Yeah, but what is an utterance. Uh, yeah. Okay. So what we what we would fit in into the XML files, would it be uh um a value for each utterance? Or what you Okay. Mm-hmm. So then we would have to fit in more than just one where values um Yeah. But um Mm uh-huh. Yeah, it's better. So we n would need a value label for each of those segments that are not dialogue acts. For all those that point to the actual words. Yeah, that's there. No, above. Yeah. Yeah. So you mean an a an attribute. Yeah. Wouldn't it be easiest to just incorporate a new attribute in in this file? Yeah, I don't know, that's probably not very uh nice but would quite easy, wouldn't it? Yeah. But they display the segments in or that the utterances i in their um user interface as well, where the words are y um displayed. So so they have they have their utterances displayed in their interface. So that's they um access probably this file and then they display the words. So why shouldn't we be able to do that? Uh Yeah, the problem is probably that um I extract all the words and then uh I don't um use an I_D_ or something for it, so it would be difficult to write it back to the right position. It depends on the position, I think. Yeah. Yeah, I I could use another a different approach, though. It Mm yeah. Mm-hmm. Um would you think it would make sense to just take um file files by speaker. It's it doesn't matter what input I give to um to Rainbow. So I just could g use the files as they are. I don't know if that's that would give an output. I don't know. No, it um I think it's not uh versus the whole corpus. It's um you have certain categories, and you measure which words um have the highest information for one category. So it's the categories across each other I think. Um probably yeah. But yeah, it's it's s strange. Yeah. Yeah, but what what the problem is, I don't know exactly. I think the information gain in Rainbow is ordered by the value of the information gain, not I am not sure if I can get the right order and the values. That's the problem. But I um if the order stays the same, it's no problem at all to i just write back again. But if uh it's ordered by information gain, I don't know where the words come from, because it's it has a bag of words representation. Uh yes. Is it is. It was made for text classification. Yeah, that's the problem. Yeah, um what it actually does is that you you put in some documents, and you have several documents per category. Um and you have several categories. And then it measures um which words are typical for a certain t category. And if you get a new document, it will um compare which words are in that new document. And if there are a lot of words that um are typical for one category, it will assign it to that category, and if it's typical for another one, it will assign it to that one. Yeah, it's because you have um y um you have um oth different files. If if this is your directory, you have um um a diagra um, a directory one, two and three. And this represents the category. Everything that's in there is in category one. A Yeah, just split it up somehow, yeah. Yeah. No, I've just um split them up uh somehow um by um there are several documents for um each meeting, and I just put one in each category. So uh it's it's not very sensible at the moment, because I'm waiting for the um topic segments or I'm just yeah. Yeah, yeah, probably. Yeah, maybe it's it's better if I write it myself, because otherwise it's too easy to just split things up into bins, and it it wouldn't be any work at all in terms of programming or something. Alright. Mm-hmm. Mm-hmm. How does it calculate that actually? Okay. Yeah, but that's the sa uh almost I think similar to what I'm doing, because words that are in every class, that are not very informative, but words that are only in one class are are very informative for that class. Oh okay. Alright. Yeah, that's th yeah. For a specific word per per All over the place. And in my case it would have I think it would have different uh values for each category. But in that category it would have the same value at each place. Do you know what I mean? Yeah. Um that would be best, but I I would have to look if that. Um at least it works if there are several categories with each with one document each, but it um yeah. Yeah, that's the questio yeah, I don't know. Mm-hmm. Mm okay. Oh no, I li I th I thought something else that we d um we just split I need somewhere to split, and that um splitting at category boundaries um splitting at topic boundaries would a nice thing to do, rather than just splitting somewhere. So yeah, yeah. Yeah, but it I think it should work for just one document, because it it compares between the categories, and if you just have one document um it still can find out which words are informative for that category and which are not. So if you have just one document in each category, and there are a lot of um occurrences of the word the in each one, so this word will be not very informative across the categories. So it should work for one, but I'm not sure if how exactly it it um calculates everything. Uh I think it's not possible to look that up. Yeah, maybe it's possible to have a list that it's o that is ordered by Yeah, don't know. Mm-hmm. Um I don't know what you mean by that. It's uh Mm-hmm. Oh right. Mm. Alright. Yeah, you you can use um some kind of um truncation maybe if you attach a number to each word and say that it should omit the last part. Uh probably not. No, no. Yeah, maybe I should try something different and just programme it for myself. Because um Mm. Mm-hmm. Mm-hmm. Yeah, but how can we get to know what makes sense as a a function for joining everything together? Might be difficult to find that out. Yeah. Mm-hmm. You you could always find out how many words there are in an utterance, couldn't you? Yeah. So For example. Mm. So but Mm. Mm-hmm. Mm. Yeah, then we sh yeah. We could even yeah, we could even have a look at our different measures, if they um come up with the same same kinds of yeah. Uh I think if I can provide something for the words, as they are It's stated from where to where the segments go. It should be uh should be possible. Yeah, so it must should be possible. Yeah. Mm-hmm. You mean by matching strings. Or what? Yeah, but what I've done is um a parser, an X_M_L_ parser where you can get the start times. So, yeah, that's quite easy, because it's it's an attribute and you just s say that you want the values of those attributes. I think that should be possible. I don't know how long it takes me. But Mm. Is laughter annotated at all? Because you could take that on ours out maybe. If you take the laughter out and then calculate it. Mm. Yeah. Not the moment. Maybe if we meet at the weekend. Oh I don't know. Yep. So uh no, how about the the prototype. If we want to show him something on Monday, we definite have to work together. Some of us at least have to work together to get it running probably. Alright.. Yeah, me too. ... Yeah, and also we don't have to recalculate if just one Okay.
Look's good. Yeah. Okay. Yeah, that's true I think. Oh yeah. Yeah. Yeah, we haven't got that much to talk about, I don't think. Yeah. Yes, we can do that. Alright. Yeah well, that that's the way it works. It's got delimiters in-between the mar the boundaries. But Yeah, if you're happy to do that, yeah. And you're saying you need a a label for each segment. But I mean the most I ca you don't need a one. Well, I could give you an I_D_, but it'd be quite difficult to actually give a topic a title. Right. Okay. Yeah, that's no problem. Hmm. Mm. Um is it not segments? Or is it dialogue act dialogue acts? Yeah. I think that's it, yeah. Yeah, they've all got an I_D_ for each utterance. Yeah, and it's got a different uh file for each speaker. N no, it just points to the to the words. Yeah, yeah. The I suppose so, the utterances. But also includes things like like Yeah, think so. Mm. Yeah, can you not just leaves those lines blank? Oh alright. Okay. Hmm. Yeah, what did you use to make that file? Just in Perl? You didn't use an X_M_L_ parser? Okay. Hmm. Hmm. Hmm. Well like function words and stuff. Mm. Hmm. Well, um when it splits the topics up, it does do it on regular words that um that occur. But it doesn't tell you what they are, no. Hmm. Yeah. Yeah. Yeah, you can't really get any other output. No. No. Mm-hmm. Yeah, it's supposed to split it into cu coherent topics with the similar information. Hmm. Hmm. Yeah. Hmm. Hmm. Hmm. Hmm. Hmm. Yeah. Oh I'm sure it'd be quite straightforward, some of the tasks we have to do. But yet no one really understands them, the actual X_M_L_ parsing. Hmm. Yeah, that sounds like an idea. Yeah. Yeah, just send us an email and tell us what's happening. Right.
